{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "## Открытый курс по машинному обучению. Сессия № 2\n",
    "Автор материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий. Материал распространяется на условиях лицензии [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>Тема 10. Бустинг\n",
    "## <center>Часть 7. Xgboost и Hyperopt в соревновании Kaggle Forest Cover Type Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Соревнование](https://www.kaggle.com/c/forest-cover-type-prediction). \n",
    "Задача учебная. Предлагается предсказывать тип лесного покрытия на участках 30х30 метров Национального заповедника Рузвельта в Колорадо.\n",
    "\n",
    "Признаки (подробней на [странице](https://www.kaggle.com/c/forest-cover-type-prediction/data) соревнования):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Elevation (высота) - Elevation in meters\n",
    "- Aspect - Aspect in degrees azimuth\n",
    "- Slope (наклон) - Slope in degrees\n",
    "- Horizontal_Distance_To_Hydrology (горизонтальное расстояние до воды) - Horz Dist to nearest surface water features\n",
    "- Vertical_Distance_To_Hydrology (вертикальное расстояние до воды) - Vert Dist to nearest surface water features\n",
    "- Horizontal_Distance_To_Roadways (горизонтальное расстояние до дорог) - Horz Dist to nearest roadway\n",
    "- Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice\n",
    "- Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice\n",
    "- Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice \n",
    "- Horizontal_Distance_To_Fire_Points (горизонтальное расстояние до центров воспламенения) - Horz Dist to nearest wildfire ignition points \n",
    "- Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation\n",
    "- Soil_Type (тип почвы) - (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation\n",
    "Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Подключаем библиотеки и загружаем данные. Используем [log_progress](https://github.com/alexanderkuk/log-progress) для отслеживания итераций в циклах.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import xgboost as xgb\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Файл forest_test.csv можно скачать [отсюда](https://drive.google.com/file/d/1Ktn5JjFlAjABp6kGDldAYI2aVG-OMgvW/view?usp=sharing)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.read_csv(\"../../data/forest_train.csv\")\n",
    "test_df = pd.read_csv(\"../../data/forest_test.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def write_to_submission_file(\n",
    "    predicted_labels, out_file, target=\"Cover_Type\", index_label=\"Id\", init_index=15121\n",
    "):\n",
    "    # turn predictions into data frame and save as csv file\n",
    "    predicted_df = pd.DataFrame(\n",
    "        predicted_labels,\n",
    "        index=np.arange(init_index, predicted_labels.shape[0] + init_index),\n",
    "        columns=[target],\n",
    "    )\n",
    "    predicted_df.to_csv(out_file, index_label=index_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Создаем признаки.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[\"Under_water\"] = train_df.Vertical_Distance_To_Hydrology < 0\n",
    "test_df[\"Under_water\"] = test_df.Vertical_Distance_To_Hydrology < 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[\"EVDtH\"] = train_df.Elevation - train_df.Vertical_Distance_To_Hydrology\n",
    "test_df[\"EVDtH\"] = test_df.Elevation - test_df.Vertical_Distance_To_Hydrology\n",
    "\n",
    "train_df[\"EHDtH\"] = train_df.Elevation - train_df.Horizontal_Distance_To_Hydrology * 0.2\n",
    "test_df[\"EHDtH\"] = test_df.Elevation - test_df.Horizontal_Distance_To_Hydrology * 0.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[\"Distanse_to_Hydrolody\"] = (\n",
    "    train_df[\"Horizontal_Distance_To_Hydrology\"] ** 2\n",
    "    + train_df[\"Vertical_Distance_To_Hydrology\"] ** 2\n",
    ") ** 0.5\n",
    "test_df[\"Distanse_to_Hydrolody\"] = (\n",
    "    test_df[\"Horizontal_Distance_To_Hydrology\"] ** 2\n",
    "    + test_df[\"Vertical_Distance_To_Hydrology\"] ** 2\n",
    ") ** 0.5\n",
    "\n",
    "train_df[\"Hydro_Fire_1\"] = (\n",
    "    train_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    + train_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    ")\n",
    "test_df[\"Hydro_Fire_1\"] = (\n",
    "    test_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    + test_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    ")\n",
    "\n",
    "train_df[\"Hydro_Fire_2\"] = abs(\n",
    "    train_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    - train_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    ")\n",
    "test_df[\"Hydro_Fire_2\"] = abs(\n",
    "    test_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    - test_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    ")\n",
    "\n",
    "train_df[\"Hydro_Road_1\"] = abs(\n",
    "    train_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    + train_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "test_df[\"Hydro_Road_1\"] = abs(\n",
    "    test_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    + test_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "\n",
    "train_df[\"Hydro_Road_2\"] = abs(\n",
    "    train_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    - train_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "test_df[\"Hydro_Road_2\"] = abs(\n",
    "    test_df[\"Horizontal_Distance_To_Hydrology\"]\n",
    "    - test_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "\n",
    "train_df[\"Fire_Road_1\"] = abs(\n",
    "    train_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    "    + train_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "test_df[\"Fire_Road_1\"] = abs(\n",
    "    test_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    "    + test_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "\n",
    "train_df[\"Fire_Road_2\"] = abs(\n",
    "    train_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    "    - train_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")\n",
    "test_df[\"Fire_Road_2\"] = abs(\n",
    "    test_df[\"Horizontal_Distance_To_Fire_Points\"]\n",
    "    - test_df[\"Horizontal_Distance_To_Roadways\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = train_df[\"Cover_Type\"]\n",
    "train_df = train_df.drop([\"Cover_Type\", \"Id\"], axis=1)\n",
    "test_df = test_df.drop([\"Id\"], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = y - 1  # Чтоб классы нумеровались от о до 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperopt import STATUS_OK, Trials, fmin, hp, tpe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def score(params):\n",
    "    from sklearn.metrics import log_loss\n",
    "\n",
    "    print(\"Training with params:\")\n",
    "    print(params)\n",
    "    params[\"max_depth\"] = int(params[\"max_depth\"])\n",
    "    dtrain = xgb.DMatrix(X_train, label=y_train)\n",
    "    dvalid = xgb.DMatrix(X_test, label=y_test)\n",
    "    model = xgb.train(params, dtrain, params[\"num_round\"])\n",
    "    predictions = model.predict(dvalid).reshape((X_test.shape[0], 7))\n",
    "    score = log_loss(y_test, predictions)\n",
    "    print(\"\\tScore {0}\\n\\n\".format(score))\n",
    "    return {\"loss\": score, \"status\": STATUS_OK}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def optimize(trials):\n",
    "    space = {\n",
    "        \"num_round\": 100,\n",
    "        \"learning_rate\": hp.quniform(\"eta\", 0.005, 0.05, 0.005),\n",
    "        \"max_depth\": hp.quniform(\"max_depth\", 3, 14, 1),\n",
    "        \"min_child_weight\": hp.quniform(\"min_child_weight\", 1, 10, 1),\n",
    "        \"subsample\": hp.quniform(\"subsample\", 0.5, 1, 0.05),\n",
    "        \"gamma\": hp.quniform(\"gamma\", 0.5, 1, 0.01),\n",
    "        \"colsample_bytree\": hp.quniform(\"colsample_bytree\", 0.4, 1, 0.05),\n",
    "        \"num_class\": 7,\n",
    "        \"eval_metric\": \"merror\",\n",
    "        \"objective\": \"multi:softprob\",\n",
    "        \"nthread\": 4,\n",
    "        \"silent\": 1,\n",
    "    }\n",
    "\n",
    "    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=10)\n",
    "    return best"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    train_df, y, test_size=0.3, random_state=17\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trials = Trials()\n",
    "best_params = optimize(trials)\n",
    "best_params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_params[\"max_depth\"] = int(best_params[\"max_depth\"])\n",
    "best_params[\"num_class\"] = 7\n",
    "best_params[\"eval_metric\"] = \"merror\"\n",
    "best_params[\"objective\"] = \"multi:softprob\"\n",
    "best_params[\"nthread\"] = 4\n",
    "best_params[\"silent\"] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtrain = xgb.DMatrix(train_df, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "xgbCvResult = xgb.cv(\n",
    "    best_params, dtrain, num_boost_round=500, nfold=3, early_stopping_rounds=50\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(range(xgbCvResult.shape[0]), xgbCvResult[\"test-merror-mean\"])\n",
    "plt.plot(range(xgbCvResult.shape[0]), xgbCvResult[\"train-merror-mean\"]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_num_round = np.argmin(xgbCvResult[\"test-merror-mean\"])\n",
    "best_num_round"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.train?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Сделаем прогноз для всей тестовой выборки.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bestXgb = xgb.train(best_params, dtrain, num_boost_round=best_num_round)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtest = xgb.DMatrix(test_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgboost_predict_proba = bestXgb.predict(dtest)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgboost_prediction = np.argmax(xgboost_predict_proba, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Мы вычитали из целевых меток 1, теперь добавляем.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgboost_prediction += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_to_submission_file(xgboost_prediction, \"forest_cover_type_xgboost.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**У такой посылки на Kaggle результат - 0.771.**"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "name": "ForestCoverType.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
